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Abstract 

We investigate stationary hidden Markov processes for which mutual 
information between the past and the future is infinite. It is assumed that 
the number of observable states is finite and the number of hidden states is 
countably infinite. Under this assumption, we show that the block mutual 
information of a hidden Markov process is upper bounded by a power law 
determined by the tail index of the hidden state distribution. Moreover, 
we exhibit three examples of processes. The first example, considered 
previously, is nonergodic and the mutual information between the blocks 
is bounded by the logarithm of the block length. The second example 
is also nonergodic but the mutual information between the blocks obeys 
a power law. The third example obeys the power law and is ergodic. 
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1 Introduction 



In recent years there has been a surge of interdisciplinary interest in excess 
entropy, which is the Shannon mutual information between the past and the 
future of a stationary discrete-time process. The initial motivation for this 
interest was a paper by Hilberg who supposed that certain processes with 
infinite excess entropy may be useful for modeling texts in natural language. 
Subsequently, it was noticed that processes with infinite excess entropy ap pea r 
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Subsequently, it was noticed tnat processes witn mnmtc excess entropy apt 
also in research of other, so called, complex systems d, [H, 14, Til. IH [ia. IE 1 
Also from a purely mathematical point of view, excess entropy is an interesting 
measure of dependence for nominal valued random processes, where the analysis 
of autocorrelation does not provide sufficient insight into process memory. 

Briefly reviewing earlier works, let us mention that excess entropy has been 
already studied for several classes of processes. The most classical results con- 
cern Gaussian processes, where Grenander and Szego [2(3, Section 5.5] gave an 
integral formula for excess entropy (in disguise) and Finch fl8| evaluated this 
formula for autoregressive moving average (ARMA) processes. In the ARMA 
case excess entropy is finite. A few more papers concern processes over a finite 
alphabet with infinite excess entropy. For instance, Bradley Q constructed the 
first example of a mixing process having this property. Gramss [19[ investigated 
a process which is formed by the frequencies of O's and l's in the rabbit sequence. 
Travers and Crutchfield [26( researched some hidden Markov processes with a 
countably infinite number of hidden states. Some attempts were also made to 
generalize excess entropy to two-dimensional random fields [It], 0] ■ 

Excess entropy is an intuitive measure of memory stored in a stochastic 
process. Although this quantity only measures the memory capacity, without 
characterizing how the process future depends on the past, it can be given 



interesting general interpretations. Mahoney, Ellison and Crutchfield [24j, [15J 
developed a formula for excess entropy in terms of predictive and retrodictive 
6-machines. which are minimal unifilar hidden Markov representations of the 
process (25|, [23J. In our previous works [Hi 11, 12, Q, we also investigated 
excess entropy of stationary processes that model texts in natural language. 
We showed that a power-law growth of mutual information between adjacent 
blocks of text arises when the text describes certain facts in a logically consistent 
and highly repetitive way. Moreover, if the mutual information between blocks 
grows according to a power law then a similar power law is obeyed by the number 
of distinct words, identified formally as codewords in a certain text compression 
7]. The latter power law is known as Herdan's law [2l[, which is an integral 
version of the famous Zipf law observed for natural language [28| . 

In this paper we will study several examples of stationary hidden Markov 
processes over a finite alphabet for which excess entropy is infinite. The first 
study of such processes was developed by Travers and Crutchfield [26[ . A few 
more words about the adopted setting are in need. First, excess entropy is fi- 
nite for hidden Markov chains with a finite number of hidden states. This is 
the usually studied case [l6| , for which the name of finite-state sources is also 
used. To allow for hidden Markov processes with unbounded mutual informa- 
tion, we need to assume that the number of hidden states is at least countably 
infinite. Second, we want to restrict the class of studied models. If we admitted 
an uncountable number of hidden states or a nonstationary distribution over 
the hidden states then the class of hidden Markov processes would cover all 
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processes (over a countable alphabet). For that reason we will assume that the 
underlying Markov process is stationary and the number of hidden states is ex- 
actly countably infinite. In contrast, the number of observable states is fixed as 
finite to focus on nontrivial examples. In all these assumptions we follow [2t| . 

The modest aim of the present paper is to demonstrate that power-law 
growth of mutual information between adjacent blocks may arise for very simple 
hidden Markov processes. Presumably, stochastic processes which exhibit this 
power law appear in modeling of natural language 22, 1(J. But the processes 
that we study here do not have a clear linguistic interpretation. They are only 
mathematical instances presented to show what is possible in theory. Although 
these processes are simple to define, we perceive them as somehow artificial 
because of the way how the memory of the past is stored in the present and 
revealed in the future. Understanding what are acceptable mechanisms of mem- 
ory in realistic stochastic models of complex systems is an important challenge 
for future research. 

The further organization of the paper is as follows: In Section [2] we present 
the results, whereas the proofs are deferred to Section [3J 



2 Results 

Now we begin the formal presentation of our results. First, let (Yi)igz be a sta- 
tionary Markov process on (fi, J, P) where variables Y{ : f2 — > Y take values 
in a countably infinite alphabet Y. This process is called the hidden process. 
Next, for a function / : Y — > X, where the alphabet X = {0, 1, D — 1} is 
finite, we construct process (Xi) i£ z with 

X i =f(Y i ). (1) 

Process (Xi)i e z will be called the observable process. The process is called 
unifilar if Yi + i — g(Yi,Xi + i) for a certain function g : Y x X — >• Y. Such 
a construction of hidden Markov processes, historically the oldest one |2j, is 
called state-emitting (or Moore) in contrast to another construction named edge- 
emitting (or Mealy) . The Mealy construction, with a requirement of unifilarity, 
has been adopted in previous works 0, 0, [23] . Here, we adopt the Moore 
construction and we drop the unifilarity assumption since it leads to a simpler 
presentation of processes. It should be noted that the standard definition of 
hidden Markov processes in statistics and signal processing is yet up to a degree 
different, namely the observed process (Xj)j e z depends on the hidden process 
(^i)iez via a probability distribution and Xi is conditionally independent of the 
other observables given Yj. All the presented definitions are, however, equivalent 
and the terminological discussion can be put aside. 

In the following turn we inspect the mutual information. Having entropy 
H{X) = E [— \ogP{X)] with log denoting the binary logarithm throughout this 
paper, mutual information is defined as I(X;Y) = H(X) + H(Y) — H(X, Y). 
Here we will be interested in the block mutual information of the observable 
process, 

E(n):=I(X°_ n+1 ;X?), (2) 

where X l k denotes the block (Xi)k<i<i- More specifically, we are interested 
in processes for which excess entropy E — Ymin^oo E(n) is infinite and E(n) 
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diverges at a power-law rate. We want to show that such an effect is possible for 
very simple hidden Markov processes. (Travers and Crutchfield [26| considered 
some examples of nonergodic and ergodic hidden Markov processes with infinite 
excess entropy but they did not investigate the rate of divergence of E(n).) 
Notice that by the data processing inequality for the Markov process (Yj)j e z, 
we have 

E(n) < I{Y* n+l -Y?) = IpbsYi) < H(Y ). (3) 

Thus the block mutual information E(n) may diverge only if the entropy of 
the hidden state is infinite. To achieve this effect, the hidden variable Yq must 
necessarily assume an infinite number of values. 

Now we introduce our class of examples. Let us assume that hidden states 
<j n k may be grouped into levels 

T n :— {a'nfc}i<jt< r ( rl ) (4) 

that comprise equiprobable values. Moreover, we suppose that the level indica- 
tor 



Ni := n Yi E T n (5) 

has distribution 

P(N l =n)= ° a . (6) 
n log n 

For a <E (1, 2], entropy H(Ni) is infinite and so is H(Yi) > H(Ni) since Ni is a 
function of Y^. In the following, we work with this specific distribution of Yi. 

As we will show, the rate of growth of the block mutual information E(n) is 
bounded in terms of exponent a from equation ©. Let us write f(n) = 0(g(n)) 
if f(n) < Kg(n) for a K > and f(n) = Q(g(n)) if K x g{n) < f(n) < K 2 g(n) 
for Kx,K 2 > 0. 

Theorem 1 Assume that Y = {&nk}i<k<r(n) n>2' where function r(n) satisfies 
r{n) — 0(n p ) for a p £ N. Moreover assume that 

P(Y t =a nk ) = ^--—^, (7) 
r(n) nlog n 

where a E (1,2] and C" 1 = X)^°=2( n ^°§ a n)^ 1 . Then we have 

E { n) = {°^ (8) 

The interesting question becomes whether there exist hidden Markov pro- 
cesses that achieve the upper bound established in Theorem [1] If so, can they 
be ergodic? The answer to both questions is positive and we will exhibit some 
simple examples of such processes. 

The first example that we present is nonergodic and the mutual information 
diverges slower than expected from Theorem [T] 
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Example 1 (Heavy Tailed Periodic Mixture I) This example has been 
introduced in ]2aj . We assume Y = {o~nk}\<k<r(n) n>2 , where r(n) = n. Then 
we set the transition probabilities 

P(V lv , H 1 !• l</<m-l, , u . 

P(Y l+1 = a nk \Yi = a m i) = < (9) 

= 1}, / = 171. 

We can see that the transition graph associated with the process (li)iez consists 
of disjoint cycles on levels T n . The stationary distribution of the Markov process 
is not unique and the process is nonergodic if more than one cycle has a positive 
probability. Here we assume the cycle distribution so the stationary marginal 
distribution ofYi equals ([7J). Moreover, the observable process is set as 

x [0, Yi = *nk,l<k<n-1, (1Q) 

I 1 3 Y\ 0~nn • 

In the above example, the level indicator iVj has infinite entropy and is 
measurable with respect to the shift invariant algebra of the observable process 
(Xi) ie z. Hence E(n) tends to infinity by the ergodic decomposition of excess 
entropy [8|, Theorem 5] . A more precise bound on the block mutual information 
is given below. 

Proposition 1 For Example^ we have 

E(n)={ e ^ an }> ae(1 < 2) < (11) 
I (log log n), a = 2. 

The next example is also nonergodic but the rate of mutual information 
reaches the upper bound. It seems to happen so because the information about 
the hidden state level is coded in the observable process in a more concise way. 

Example 2 (Heavy Tailed Periodic Mixture II) We assume that Y = 
{ <T nfe}i<fc< r (n) n>2' w here r(n) = s(n) is the length of the binary expansion of 
number n. Then we set the transition probabilities 

P(Y i+1 = o- nk \Y i = * ml ) = \]\ n = m > k = l + 1 ^ ^sM-l, 

I L{n = m,k = 1}, I = s{m). 

Again, the transition graph associated with the process (li)igz consists of disjoint 
cycles on levels T n . As previously, we assume the cycle distribution (0) and the 
marginal distribution {71). Moreover, let b(n, k) be the k-th digit of the binary 
expansion of number n. (We have b(n, 1) = X.) The observable process is set as 

X^\ 2 \ . Y * = °nl, (13) 

\b{n, k), Y t = a nkl 2 < k < s(n). 
Proposition 2 For Example^ we have 

E(n) = l @ f^ ^ (1 ' 2) ' (14) 
V ^ le(logn), a = 2. ' ! 
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In the third example the rate of mutual information also reaches the up- 
per bound and the process is additionally ergodic. The process resembles the 
Branching Copy (BC) process introduced in 26]. There are three main dif- 
ferences between the BC process and our process. First, we discuss a simpler 
nonunifilar presentation of the process rather than a more complicated unifilar 
one. Second, we add strings of separators (s(m) + 1) x 3 in the observable 
process. Third, we put slightly different transition probabilities to obtain a sim- 
pler stationary distribution. All these changes lead to a simpler computation of 
mutual information. 

Example 3 (Heavy Tailed Mixing Copy) Let Y = {o'r l fc} 1 < fe < J .(„) i „> 2 with 
r(n) = 3s(n) and s(n) being the length of the binary expansion of number n. 
Then we set the transition probabilities 

Pfv w \ {Hn = m,k = l + 1}, l<l<r{m)-l, 

P{Y i+1 = a nk \Y t = a m i) = < , _ ,, .,, . , . (15) 

I p[n)l{k = 1), I = r(m), 

where 

..ID 
P{n) = -7-r • — : — — (16) 
r[n) nlog n 

and D^ 1 = Xm=2( r ( rj ') ' nlog a n) -1 . This time levels T n communicate through 
transitions er mr (m) — * &nl happening with probabilities p(n). The transition 
graph of the process (l^)iez is strongly connected and there is a unique stationary 
distribution. Hence the process is ergodic. It can be easily verified that the 
stationary distribution is ^ so the levels are distributed according to (0). As 
previously, let b{n, k) be the k-th digit of the binary expansion of number n. The 
observable process is set as 



2, Yi = a 



n 1 j 



_ b(n, k), Y t = a nk , 2<k< s(n), 

S 3, Y = a nk , s(n) + l<k< 2s(n) + 1, 

b(n,k- 2s(n)), Y, = a nk , 2s(n) + 2<k< 3s(n). 

Proposition 3 For Example^ E(n) satisfies JJ^[). 



Resuming our results, we make this comment. The power-law growth of 
block mutual information has been previously considered a hallmark of stochas- 
tic processes that model "complex behavior" , such as texts in natural language 
[HI Hi [|| • However, the constructed examples of hidden Markov processes fea- 
ture quite simple transition graphs. Consequently, one may doubt whether 
power-law growth of mutual information is a sufficient reason to call a given 
stochastic process a model of complex behavior, even when we restrict the class 
of processes to processes over a finite alphabet. Basing on our experience with 
other processes with rapidly growing block mutual information 1^, U, I2j 11 > 



which are more motivated linguistically, we think that infinite excess entropy is 
just one of the necessary conditions. Identifying other conditions for stochastic 
models of complex systems is a matter of further interdisciplinary research. We 
believe that these conditions depend on a particular system to be modeled. 
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3 Proofs 

We begin with two simple bounds. 

Lemma 1 Let a G (1,2]. On the one hand, we have 

m=2 mlog m [(In 2 2) log log n, a = 2, 

where < S± < 1/2. On the other hand, we have 

£ - r 4-=<S2 + ^rlog 1 -° n (19) 
■* — ' m log m a — 1 

n 

where < c>2 < (n log" n) 



m—n 

(-1 



Proof: For a continuous decreasing function / we have J f(m)dm < 
EL f( m ) < /(«) + Iaf( m ) dm - Moreover, 



drn /- log " (ln2)dp _ I (log 2 "" n - l) , a e (1, 2), 




mlog a - m 7i P a " [(In 2 2) log log n, a = 2, 

30 dm 

m\og a m Jlogn 

Hence the claims follow. □ 



1 ~1 — ot 

log n. 



For an event B, let us introduce conditional entropy H{X\B) and mutual 
information I(X;Y\B), which are respectively the entropy of variable X and 
mutual information between variables X and Y taken with respect to probability 
measure P{-\B). The conditional entropy H(X\Z) and information I(X;Y\Z) 
for a variable Z are the averages of expressions H(X\Z = z) and I(X; Y\Z = z) 
taken with weights P(Z = z). That is the received knowledge. Now comes 
a handy fact that we will also use. Let 1b be the indicator function of event B. 
Observe that 

I(X; Y) = I{X; Y\I B ) + I(X; Y; I B ) 

= P(B)I(X; Y\B) + P(B C )I(X; Y\B C ) + I(X; Y; I B ), (20) 

where the triple information I(X; Y; I B ) satisfies \I(X; Y; I B )\ < H(I B ) < 1 by 
the information diagram [271 ]. 

Proof of Theorem [1} Consider the event B — (N a < 2"), where No is the 
level indicator of variable Yq. On the one hand, by Markovianity of (Yi)i e %, we 
have 

I(X°_ n+1 ;X?\B)<I(Y° n+1 ;Y?\B) 

</(Y ;*i|5) <H(Y \B). 

On the other hand, for B c , the complement of B, we have 

I(X° n+1 ;X?\B c ) < H(X°_ n+1 \B c ) < nlog|X| , 
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where |X|, the cardinality of set X, is finite. Hence, using (j2"0|) , we obtain 

E(n) < P(B)I(X°_ n+1 ;X?\B)+P(B c )I(X°_ n+1 ;X?\B c ) + 1 

< P(B)H(Y \B) + nP(B c ) log |X| + 1, (21) 

where 

* f < 



m=2 

Using (|18[) yields further 

2" 



mlog m 



P(B)H(Y \B) = P { B) £ C , 5 log ^)-^)-^log^ 

^-^ P(B)-mlog to C 



ro=2 



EC , r(m) • m log a m 
j 5 l0 § ^ + ^ B log P(B) 

_ to log to C 



m=2 
/ 2 



e | 1 \ _ J e (" 2 ~") ■ ae(l,2), 



Vm=2 



m log" m / 16 (log n) , a = 2. 



On the other hand, by (|T9|) . we have 

to log" TO 



nP(B c )=n i 5 =e(n 2 -^ 

* — » m Ino m 

m=2"+l 



Plugging both bounds into ([2"T]) yields the requested bound ((5J). □ 

Now we prove Propositions [T] El The proofs are very similar and consist in 
constructing variables D n that are both functions of X°_ n+l and functions of 
X". Given this property, we obtain 

E(n) = I(X°_ n+1 ,D n ;X?) = I(D n ;X?) + I(X°_ n+1 ;X?\D n ) 

= H(D n ) + I(X° n+1 ;X?\D n ). (22) 

Hence, some lower bounds for the block mutual information E(n) follow from 
the respective bounds for the entropies of D n . 



N , 2N <n, 
0, 2N > n . 



Proof of Proposition [TJ Introduce random variable 

D n = 

Equivalently, we have 

D n = 



N\, 2Ni < n, 
0, 2Ni > n. 



It can be seen that D n is both a function of X®_ n+l and a function of X" . 
On the one hand, observe that if 2iVo < n then we can identify Nq given X°_ n+1 
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because the full period is visible in X^_ n+1 , bounded by two delimiters 1. On 
the other hand, if 2Nq > n then given X°_ n+l we may conclude that the period's 
length A*o exceeds n/2, regardless whether the whole period is visible or not. 
Hence variable D n is a function of X°_ n+1 . In a similar way, we show that D n 
is a function of X™. Given both facts, we derive ((22)) . 

Next, we bound the terms appearing on the right hand side of (f2"2")l . For 
a given Nq, variable X®_ n+1 assumes at most No distinct values, which depend 
on Nq. Hence 

H(X°_ n+1 \D n = to) < logm for 2 < to < [n/2\. 

On the other hand, if we know that Nq > n then the number of distinct values 
of variable X^_ n+1 equals n + l. Consequently, if we know that D n = 0, i.e., 
Nq > [n/2\ + 1, then the number of distinct values of X^ n+1 is bounded above 

by 

^ n(n + l) I n/2 1 (In/2 + 1) 

n+l + V m = n + l + -i— '- + 1 1 J yl J J '- 

z — / 2 2 

m=Lri/2j+l 

3n 2 + 14n + 8 25 9 

< < — n . 

8 " 8 

In this way we obtain 

H(X°_ n+1 \D n = 0)<log(25n 2 /8). 
Hence, by (|18[) and (|T5J), the conditional mutual information may be bounded 
I(X°_ n+1 ;X?\D n ) < H(X°_ n+1 \D n ) 

Ln/2j 

= ^ P(D n = m)H(X°_ n+1 \D n = to) 

m=2 

-0 



< 



E 



P(D n = 0)# (X^il- ™ = 0) 

Clog to ^ C*log(25n 2 /8) 



TO log" TO TO log" 

m=2 b m=|n/2J+l 

2— a 



= le(log^ Q n), a €(1,2), 
\6(loglogn), a = 2. 

The entropy of D n may be bounded similarly, 

H(D n ) = £ -pV" log - P(O n = 0) log P(D n = 0) 

z — ' TO log TO O 

m=2 b 

= fe(log 2 - Q n), a 6(1, 2), 
(OQoglogn), a = 2. 

Hence, because E(n) satisfies (|2"2")l . we obtain (fTTj) . □ 

Proof of Proposition [2} Introduce random variable 



D„ 



No, 2s{N ) < n, 
0, 2s{N ) > n. 
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Equivalently, we have 

D n = 



N 1} 2s(N 1 ) < n, 
0, 2s(N 1 ) > n. 



As in the previous proof, the newly constructed variable D n is both a func- 
tion of X°_ n+1 and a function of X™. If 2s(No) < n then we can identify No 
given X°_ n+1 because the full period is visible in X°_ n+1 , bounded by two de- 
limiters 2. If 2s(iVo) > n then given A°„ +1 we may conclude that the period's 
length s(Nq) exceeds n/2, regardless whether the whole period is visible or not. 
Hence variable D n is a function of X°_ n+1 . In a similar way, we demonstrate 
that D n is a function of X™. By these two facts, we infer ([221 . 

Observe that the largest to such that s(m) = |logmJ + 1 < [n/2\ is m = 
2L"/2J _ i. Using JJ5]) ) the entropy of D n may be bounded as 

H(D n )= £ _^_log!^-^-P(i? n = 0)bgP(23 n =0) 

m=2 b 

(e(n 2 -«), ae(l,2), 

[e(logn), a = 2, 
Thus (O follows by © and Theorem [TJ □ 
Proof of Proposition [3j Introduce random variable 



\o, 

Equivalently, we have 



Yq = a m k, s(m) + 1 < k < 2s(m), 2s(m) < n, 
else. 



m, Yi = a m k, s(m) + 2 < k < 2s(m) + 1, 2s(m) < n, 
0, else. 



Again, it can be seen that D n is both a function of X^_ n+1 and a function 
of A™. The way of computing D n given A° n+1 is as follows. If 

A°„ + i = (-. 2, Km, 2),b(m, 3), b{m, s(m)), 3, 3) 

/ times 

for some to such that 2s(m) < n and 1 < I < s(to) then we return D n = to. 
Otherwise we return D n = 0. The recipe for D n given X" is mirror-like. If 

X[ l = (3^, b(m, 2), 6(m, 3), 6(to, s(to)), 2, ...) 

/ times 

for some to such that 2s(m) < n and 1 < / < s(m) then we return D n = to. 
Otherwise we return D„ = 0. In view of these observations we derive ([22|) . as 
in the previous two proofs. 

Now, for to and s(m) < n/2, the distribution of D n is 



3s (to) to log" to 3 to log" to 
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Notice that the largest m such that s(m) = [logmj +1 < [n/2\ is m = 2L"/ 2 J — 1. 
Hence, by (Ti~8"|) , the bound for the entropy of D n is 

H(D n ) J'j; 1 — °— log 3 -^f^ P(D n = 0) log P(D n = 0) 
* — ' 3m log m C 

m=2 b 

= fe(n 2 -«), a €(1,2), 
\e(logn), a = 2, 

Consequently, (O follows by dHJ) and Theorem Q] □ 
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